Dynamic applications in ViEWS serve as important forecasts in their
own right. Since the data are inherently serially correlated (in space,
time and both) it makes sense that one look at and consider dynamic
forecasts as a baseline.
Above the density baselines proposed as part of ViEWS 2.0, the forecasts
proposed here provide Bayesian density forecasts that allow for the
evaluation of dynamics (autoregression, random walks [with and without
drift], and time trends) and for different distributional assumptions
(e.g., Poisson, negative binomial, zero-inflated, Tweedie). The idea
here is that the forecasts proposed should be baseline in the sense that
no-change or density forecasts account for the basic properties of the
data (beyond which covariates can then be considered by either LASSO or
elastic-net additions, which we leave for others).
The goal here is to take a fully Bayesian approach that admits that for the conflict forecaster is uncertain about the following choices:
Data in the sample / training: this in the main is addressed by the design of the forecasting competition itself.
Choice of the parametric density (e.g., Poisson, compound Poisson, negative binomial). A review of the papers from the previous event shows that this is a question to be considered and especially in light of the forecast baseline being a simple Poisson distribution (to be estimated based on the training sample chosen above).
Specification of the dynamics: autoregression, local trends, global trends, relative weighting of each. In the earlier analyses of the pgm and cm data various approaches to nearness in space-time were considered for how serially correlated the events predicted may be. The idea here is to benchmark and systematize how one at least thinks about this in the time dimension
Evaluation of the forecast densities: focus on distributions, not on single quantiles nor onsummary statistics (means, medians). This requires thinking about the comparing the relative costs via Murphy diagrams, CRPS, DRPS, etc.
As a critical consideration here, being “Bayesian” in this context is not to admit that one class or set of prior beliefs exists about the data, the forecast models, or the parmeters alone (in these models). The idea here is one that models and probabilistic statements are open for interpretation and the weighting of evidence about what one more or less sees as true (for approaches to this see (Gill 2021; McElreath 2018)). So based on in- and out-of-sample comparisons as an assessment will and ought to be made that looks at the relative properties and beliefs about the performance of models and their forecast densities.
This file and the code here estimates the basic models reported and selected in the analyses. The code is not meant to be fast — it runs serially — and does not include any optimizations for running in parallel (e.g., run the same model with \(k\) different PDFs on separate cores). It is meant to be illustrative, though it is what is used for the model selection steps for the subsequent Bayesian density forecasts.
Here begin with setting up the training data as provided by ViEWS. This takes the data as given from ViEWS and reads it into several data frames and some subsets.
Now these are compressed, which means there are only data for all observations that are observed.
If you run your own version of this, you will likely need to change the paths to the input files. You can get te input files here.
# For the parquet files
library(arrow)
##
## Attaching package: 'arrow'
## The following object is masked from 'package:utils':
##
## timestamp
# CM level data: features and outcomes
cmf <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2017.parquet")
#cmf17 <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2017.parquet")
cmf18 <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2018.parquet")
#cmf19 <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2019.parquet")
#cmf20 <- read_parquet("~/ViEWS2/shared_competition_data/cm_features_to_oct2020.parquet")
# CM level data: outcomes
cma <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2018.parquet")
#cma18 <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2018.parquet")
#cma19 <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2019.parquet")
#cma20 <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2020.parquet")
#cma21 <- read_parquet("~/ViEWS2/shared_competition_data/cm_actuals_2021.parquet")
countries <- read.csv("~/ViEWS2/shared_competition_data/country_list.csv")
# Save
save.image("ViEWS2-training.RData")
This will not work with some of the smoothing spline methods (which
cannot handle missing data), but need to be told that it is missing and
where it may be missing. To address this, the following code grabs the
main identifiers from the cm data and pads it out
appropriately.
# Start by making a sub-sample, since we want to just test code at this stage
# so this will make things faster.
x <- cmf[,45:95] # Some features
y <- cmf$ged_sb # Main QoI
yl <- cmf[,10:15] # Lags of QoI
dt1 <- as.data.frame(cbind(as.factor(cmf$country_id), cmf$month_id, y, yl, x))
colnames(dt1) <- c("series", "time", "y", colnames(yl), colnames(x))
# Cleanup
rm(y,x,yl)
Now we need to check the completeness of the data / time balances.
This is necessary because the mvgam package has
requirements for this to make the mgcv smoothing splines
for the time series models of interest here.
library(tidyr)
# Data setup check step from mvgam() : below is direct from the package
all_times_avail = function(time, min_time, max_time){
identical(as.numeric(sort(time)),
as.numeric(seq.int(from = min_time, to = max_time)))
}
# Data defn step
data_train <- dt1
min_time <- min(data_train$time)
max_time <- max(data_train$time)
data.frame(series = data_train$series,
time = data_train$time) %>%
dplyr::group_by(series) %>%
dplyr::summarise(all_there = all_times_avail(time,
min_time,
max_time)) -> checked_times
# if(any(checked_times$all_there == FALSE)){
# stop('One or more series in "data" is missing observations
# for one or more timepoints', call. = FALSE)
# }
# Look at the results
# table(checked_times)
# Observations in the training data per country
# rowSums(table(dt1$series, dt1$time))
# These are the ones that are complete (as compared to those in checked_times)
allt <- checked_times$series[checked_times$all_there==TRUE]
So at this stage one faces a modeling choice to use the
mvgam models:
drop all of the incomplete cases (i.e., those where data are not recorded over some time periods), or
pad the dataset
# Full, padded data
tmp <- expand.grid(series = unique(dt1$series),
time = unique(dt1$time),
KEEP.OUT.ATTRS = FALSE)
tmp <- expand.grid(series = allt, time = unique(dt1$time),
KEEP.OUT.ATTRS = FALSE)
# Not adding all of the empty covariates (yet)
dt2 <- merge(tmp, dt1[,1:9], all.x = TRUE)
dt2 <- (dt1[dt1$series %in% allt,])
dt2$series <- droplevels(dt2$series)
# Here's the check
# Note that the data are fully padded out here (same was not true above):
rowSums(table(dt2$series, dt2$time, useNA = "always"))
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17
## 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334
## 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50
## 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334
## 52 53 54 55 58 60 62 64 66 67 69 70 73 74 76 77
## 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334
## 78 79 80 81 82 85 87 89 90 93 94 96 97 99 100 101
## 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334
## 104 105 107 108 109 112 116 118 119 120 121 127 128 129 130 132
## 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334
## 133 135 136 138 139 140 142 143 145 146 147 148 149 150 151 152
## 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334
## 153 154 155 156 157 158 159 160 161 162 164 165 166 167 168 169
## 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334
## 171 172 173 174 177 178 179 180 181 182 183 198 199 205 206 213
## 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334 334
## 214 218 220 222 223 234 235 237 243 244 <NA>
## 334 334 334 334 334 334 334 334 334 334 0
The Estimators in this section use basic unit specific models and allow for different dynamics and functional distributional forms. In this section the above is extended to look at the following:
Poisson
Zero inflated Poisson (to better model the “excess” or common null reporting of non-events).
Negative binomial models: these fit the same mean function as the Poisson model, but they allow for a variance specification so that the variance of the counts is larger than the mean number of events per unit.
Compound Poisson / Tweedie models. These are less well known generalizations of the above (at least among social scientists who are probably more familiar with (King 1989)). Details are in [P. K. Dunn and Smyth (2005); @dunn2008evaluation; P. Dunn (2017)].
dt1 <- dt1[,1:10] # Only grab the first 10 col
# library(mvgam)
# library(lme4)
# library(gamm4)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.
library(glmmTMB)
## Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
## glmmTMB was built with TMB version 1.9.4
## Current TMB version is 1.9.6
## Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
Fits local Poisson and tensor Poisson models.
# Poisson models -- simple
psn.global <- gam(y ~ ged_sb_tlag_1 + s(time),
data=dt1,
family = "poisson",
method = "REML")
summary(psn.global)
##
## Family: poisson
## Link function: log
##
## Formula:
## y ~ ged_sb_tlag_1 + s(time)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.872e+00 1.031e-03 2786 <2e-16 ***
## ged_sb_tlag_1 1.569e-04 1.175e-07 1336 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(time) 8.999 9 417735 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = -0.193 Deviance explained = 9.1%
## -REML = 5.0097e+06 Scale est. = 1 n = 62474
plot(psn.global, pages=1)
psn.local <- bam(y ~ ged_sb_tlag_1 + s(time, series, bs="fs"),
data=dt1,
family = "poisson",
discrete=TRUE)
## Warning in bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,
## : algorithm did not converge
## Warning in bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,
## : fitted rates numerically 0 occurred
summary(psn.local)
##
## Family: poisson
## Link function: log
##
## Formula:
## y ~ ged_sb_tlag_1 + s(time, series, bs = "fs")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.831e+03 1.022e+03 -4.729 2.26e-06 ***
## ged_sb_tlag_1 -2.871e-05 3.635e-07 -78.967 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(time,series) 872.9 311 3.048e+11 0.864
##
## Rank: 2130/2132
## R-sq.(adj) = -9.34e+05 Deviance explained = -1.57e+03%
## fREML = 9.7123e+05 Scale est. = 1 n = 62474
plot(psn.local, pages=1)
psn.local.te <- bam(y ~ ged_sb_tlag_1 +
te(time, series, bs="fs"),
data=dt1,
family = "poisson",
discrete=TRUE)
## Warning in bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,
## : fitted rates numerically 0 occurred
summary(psn.local.te)
##
## Family: poisson
## Link function: log
##
## Formula:
## y ~ ged_sb_tlag_1 + te(time, series, bs = "fs")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.361e+03 0.000e+00 -Inf <2e-16 ***
## ged_sb_tlag_1 -2.163e-05 3.399e-07 -63.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## te(time,series) 476.7 892 528711971 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 1065/1067
## R-sq.(adj) = 0.31 Deviance explained = 81.2%
## fREML = 1.0936e+06 Scale est. = 1 n = 62474
Fits local zero inflated Poisson and tensor zero inflated Poisson models.
# ZIP Models -- simple
zip.local <- bam(y ~ ged_sb_tlag_1 + s(time, series, bs="fs"),
data=dt1,
family = ziP(),
discrete=TRUE)
## Warning in bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,
## : algorithm did not converge
summary(zip.local)
##
## Family: Zero inflated Poisson(-0.719,0)
## Link function: identity
##
## Formula:
## y ~ ged_sb_tlag_1 + s(time, series, bs = "fs")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000e+00 0.000e+00 NaN NaN
## ged_sb_tlag_1 -3.310e-05 4.127e-07 -80.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time,series) 1356 1712 2.089e+09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 2114/2132
## Deviance explained = 75%
## fREML = 7.5093e+05 Scale est. = 1 n = 62474
plot(zip.local, pages=1)
# With tensors?
zip.local.te <- bam(y ~ ged_sb_tlag_1 + te(time, series, bs="fs"),
data=dt1,
family = ziP(),
discrete=TRUE)
summary(zip.local.te)
##
## Family: Zero inflated Poisson(-0.827,0.086)
## Link function: identity
##
## Formula:
## y ~ ged_sb_tlag_1 + te(time, series, bs = "fs")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.017e+03 1.499e+03 -4.015 5.95e-05 ***
## ged_sb_tlag_1 -2.690e-05 3.825e-07 -70.324 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(time,series) 455.6 104 1.42e+08 0.863
##
## Deviance explained = 71.6%
## fREML = 8.4175e+05 Scale est. = 1 n = 62474
Fits local negative binomial and tensor negative binomial models.
nb.local <- bam(y ~ ged_sb_tlag_1 + s(time, series, bs="fs"),
data=dt1,
family = "nb",
discrete=TRUE)
summary(nb.local)
##
## Family: Negative Binomial(0.263)
## Link function: log
##
## Formula:
## y ~ ged_sb_tlag_1 + s(time, series, bs = "fs")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.1216480 0.2764095 -18.529 < 2e-16 ***
## ged_sb_tlag_1 0.0007774 0.0001007 7.721 1.17e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time,series) 938.5 2009 9.012 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = -6.43e+28 Deviance explained = 98.5%
## fREML = 66637 Scale est. = 1 n = 62474
plot(nb.local, pages=1)
# With tensors?
nb.local.te <- bam(y ~ ged_sb_tlag_1 + te(time, series, bs="fs"),
data=dt1,
family = "nb",
discrete=TRUE)
summary(nb.local.te)
##
## Family: Negative Binomial(0.213)
## Link function: log
##
## Formula:
## y ~ ged_sb_tlag_1 + te(time, series, bs = "fs")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.9713374 0.2688804 -18.49 <2e-16 ***
## ged_sb_tlag_1 0.0011771 0.0001128 10.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(time,series) 679.2 1029 15.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = -8.21e+44 Deviance explained = 98.8%
## fREML = 66240 Scale est. = 1 n = 62474
These are versions from (Adelson 1966) and then extended to GLMs by (Jorgensen 1987). Suppose our counts are \(Y\) (surpressing and subscripts for countries and time here). For the class of EDM (Exponential Dispersion Models) defined by Jorgensen, the Tweedie group (Tweedie et al. 1984) have the property that for a count distribution
\(V(Y) = f \cdot g(E(Y))\)
where for the variance \(V()\) , \(f\) is a dispersion parameter, and \(g()\) is a function for the mean-to-variance relationship. Here that is taken to be
\(g(E(Y)) = \phi E(Y)^p\)
These are a power law case between the mean and the variance. So the standard negative binomial falls out as the special cases where \(f>0\) and \(p=1\), since it just rescales the variance relative to the mean. But there are other special cases that come from the Tweedie / compound Poisson formulation. When \(p=0\) the model reduces in a GLM setup to a normal distribution. When \(p=1\) this is the assumption of the Poisson distribution for \(Y\). For \(p = 2\) the process is gamma distributed and for \(p=3\), \(Y\) is inverse Gaussian.^[These are well known in the statistics, medical, epidemiology, and ecology modeling communities. There are easy to use tools for this as well, e.g. from 2014.] In the regression-like approach introduced above, this adds a set of dispersion terms, likely \(1 < p < 2\) that are of interest.
Fits local Tweedie and tensor Tweedie models.
tw.local <- bam(y ~ ged_sb_tlag_1 +
s(time, series, bs="fs"),
data=dt1,
family = tw(),
discrete=TRUE)
summary(tw.local)
##
## Family: Tweedie(p=1.581)
## Link function: log
##
## Formula:
## y ~ ged_sb_tlag_1 + s(time, series, bs = "fs")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.517e+00 4.383e-01 -19.430 <2e-16 ***
## ged_sb_tlag_1 1.580e-05 1.065e-05 1.484 0.138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time,series) 858.8 2011 13.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.269 Deviance explained = 90.1%
## fREML = 1.4014e+05 Scale est. = 10.515 n = 62474
plot(tw.local, pages=1)
# With tensors?
tw.local.te <- bam(y ~ ged_sb_tlag_1 + te(time, series, bs="fs"),
data=dt1,
family = tw(),
discrete=TRUE)
summary(tw.local.te)
##
## Family: Tweedie(p=1.59)
## Link function: log
##
## Formula:
## y ~ ged_sb_tlag_1 + te(time, series, bs = "fs")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.105e+00 4.094e-01 -19.80 < 2e-16 ***
## ged_sb_tlag_1 5.245e-05 1.064e-05 4.93 8.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(time,series) 617.3 1029 24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.193 Deviance explained = 88%
## fREML = 1.4396e+05 Scale est. = 11.972 n = 62474
Here we report AIC values for the earlier models
AIC(psn.global)
## [1] 10019306
AIC(psn.local)
## [1] 183433165
AIC(psn.local.te)
## [1] 2102028
#AIC(nb.global)
AIC(nb.local)
## [1] 95305.3
AIC(nb.local.te)
## [1] 97434.81
AIC(zip.local)
## [1] 1379695
AIC(zip.local.te)
## [1] 1559792
AIC(tw.local)
## [1] 95147.76
AIC(tw.local.te)
## [1] 97384.88
This section fits the GLMM versions of the models over the densities: Poisson, negative binomial, and Tweedie.
# # GLMER version
#
# psn.glmer <- glmer(y ~ (time | series), data=dt1, family="poisson")
#
# #nb.glmer <- glmer.nb(y ~ (time | series), data=dt1)
#
# # GAMM / GAMM 4 -- these need work since the groups are not right.
#
# psn.gamm <- gamm4(y ~ ged_sb_tlag_1 + s(time),
# random=~(time|series),
# data=dt1,
# family = "poisson")
#
# summary(psn.gamm$gam)
# summary(psn.gamm$mer)
#
# plot(psn.gamm$gam)
# plot(psn.gamm$mer)
#
# system.time(psn1.gamm <- gamm(y ~ ged_sb_tlag_1 + s(time),
# random=list(series=~time),
# data=dt1,
# family = "poisson"))
#
# system.time(nb.gamm <- gamm(y ~ ged_sb_tlag_1 + s(time),
# random=list(series=~1),
# data=dt1,
# family = "nb", niterPQL = 100))
#
# plot(nb.gamm$gam, pages=1)
# plot(nb.gamm$lme, pages=1)
#
# summary(nb.gamm$gam)
# summary(nb.gamm$lme)
#
#
# system.time(tw.gamm <- gamm(y ~ ged_sb_tlag_1 + s(time),
# random=list(series=~1),
# data=dt1,
# family = "tw", niterPQL = 100))
#
# plot(tw.gamm$gam, pages=1)
# plot(tw.gamm$lme, pages=1)
#
# summary(tw.gamm$gam)
# summary(tw.gamm$lme)
#
# system.time(psn.ar1.gamm <- gamm(y ~ s(time),
# correlation = corAR1(form=~-1|series),
# data=dt1,
# family = "poisson", niterPQL = 100))
# summary(psn.ar1.gamm$gam)
# summary(psn.ar1.gamm$lme)
#
# plot(psn.ar1.gamm$gam, pages=1)
# plot(psn.ar1.gamm$lme, pages=1)
#
# glmmTMB versions
# Need time as a factor for the next command...
dt1$time <- factor(dt1$time)
# This one will crash...
# options(glmmTMB.cores=4)
# psn.glmmtmb <- glmmTMB(y ~ (time|series), family = poisson,
# data=dt1)
psnar1.glmmtmb <- glmmTMB(y ~ ar1(time + 0|series), family = poisson,
data=dt1, control=glmmTMBControl(parallel = 4))
nbar1.glmmtmb <- glmmTMB(y ~ ar1(time + 0|series), family = nbinom1(),
data=dt1)
cmpar1.glmmtmb <- glmmTMB(y ~ ar1(time + 0|series), family = compois(),
data=dt1)
twar1.glmmtmb <- glmmTMB(y ~ ar1(time + 0|series), family = tweedie(),
data=dt1)
# Make a table of in-sample fits
library(bbmle)
## Loading required package: stats4
AICtab(psnar1.glmmtmb, nbar1.glmmtmb, cmpar1.glmmtmb, twar1.glmmtmb,
delta =FALSE, base=TRUE)
## AIC df
## twar1.glmmtmb 91328.1 5
## nbar1.glmmtmb 91383.6 4
## cmpar1.glmmtmb 92489.4 4
## psnar1.glmmtmb 93946.6 3
AICctab(psnar1.glmmtmb, nbar1.glmmtmb, cmpar1.glmmtmb, twar1.glmmtmb)
## dAICc df
## twar1.glmmtmb 0.0 5
## nbar1.glmmtmb 55.5 4
## cmpar1.glmmtmb 1161.3 4
## psnar1.glmmtmb 2618.5 3
BICtab(psnar1.glmmtmb, nbar1.glmmtmb, cmpar1.glmmtmb, twar1.glmmtmb)
## dBIC df
## twar1.glmmtmb 0.0 5
## nbar1.glmmtmb 46.5 4
## cmpar1.glmmtmb 1152.2 4
## psnar1.glmmtmb 2600.4 3
summary(twar1.glmmtmb)
## Family: tweedie ( log )
## Formula: y ~ ar1(time + 0 | series)
## Data: dt1
##
## AIC BIC logLik deviance df.resid
## 91328.1 91373.3 -45659.0 91318.1 62469
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev. Corr
## series time121 125.4 11.2 0.99 (ar1)
## Number of obs: 62474, groups: series, 213
##
## Dispersion parameter for tweedie family (): 5.13
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.1305 0.6384 -22.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(nbar1.glmmtmb)
## Family: nbinom1 ( log )
## Formula: y ~ ar1(time + 0 | series)
## Data: dt1
##
## AIC BIC logLik deviance df.resid
## 91383.6 91419.8 -45687.8 91375.6 62470
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev. Corr
## series time121 54.24 7.365 0.99 (ar1)
## Number of obs: 62474, groups: series, 213
##
## Dispersion parameter for nbinom1 family (): 34
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.1276 0.3806 -21.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Among the Poisson models
AICtab(psn.global, psn.local, psn.local.te)
## dAIC df
## psn.local.te 0.0 479.2
## psn.global 7917277.8 11
## psn.local 181331136.9 876
AICtab(nb.local, nb.local.te)
## dAIC df
## nb.local 0.0 943.4
## nb.local.te 2129.5 683.9
Starting with basic model selecion begins with a GAMM that includes
simple (time lagged) and deterministic effects for a global smooting
trend and local trends. These are called the local models
in what follows. These local models are estimated for the
various distributions outlined above. To compare the in-sample fit AIC
statistics are ranked as below
# Compare distributions
# AICtab(psn.local, zip.local, nb.local, tw.local)
# AICctab(psn.local, zip.local, nb.local, tw.local)
ICtab(psn.local, zip.local, nb.local, tw.local,
base=TRUE, delta=FALSE)
## AIC df
## tw.local 95147.8 863.9
## nb.local 95305.3 943.4
## zip.local 1379695.5 1360.9
## psn.local 183433165.3 876
Clearly the models that admit some dispersion of the event counts are strongly preferred.^(The same is true using BIC, or any other related information criteria measure.)
One might then suggest, how much better or worse do these models do as one changes the dynamic specifiations for the trends? This is seen by looking at tensor products of the GAM splines for the global and local trends (so a temporally more sparse model that has better degrees of freedom). Even in this case, things are no better, and in fact worse than the specifications discussed above:
# Compare distributions
AICtab(psn.local.te, zip.local.te, nb.local.te, tw.local.te)
## dAIC df
## tw.local.te 0.0 622.1
## nb.local.te 49.9 683.9
## zip.local.te 1462407.6 462.9
## psn.local.te 2004643.5 479.2
AICctab(psn.local.te, zip.local.te, nb.local.te, tw.local.te)
## dAICc df
## tw.local.te 0.0 622.1
## nb.local.te 52.6 683.9
## zip.local.te 1462401.9 462.9
## psn.local.te 2004638.4 479.2
ICtab(psn.local.te, zip.local.te, nb.local.te, tw.local.te,
base=TRUE, delta=FALSE)
## AIC df
## tw.local.te 97384.9 622.1
## nb.local.te 97434.8 683.9
## zip.local.te 1559792.4 462.9
## psn.local.te 2102028.4 479.2
If one changes the dynamic specification and moves them into the latent space (so what is done in the GAMM and GLMM-type models),
AIC(twar1.glmmtmb)
## [1] 91328.07
AIC(nbar1.glmmtmb)
## [1] 91383.6
AIC(cmpar1.glmmtmb)
## [1] 92489.35
AIC(psnar1.glmmtmb)
## [1] 93946.59
(Note: not all of the GAMM and GLMM model variants available on
R give consistent reporting or computation of the AIC or
likelihood-based fit measures.)
The code below generates plots of the models predictions, one-step.
# Plot local trends for each modeling specification
#
# pdf(file="gam-local-in-sample.pdf")
par(mfcol=c(2,2))
plot(psn.local, ylab = "Poisson trends")
abline(h=0, col="red", lwd=2)
plot(zip.local, ylab = "ZiP trends")
abline(h=0, col="red", lwd=2)
plot(nb.local, ylab = "Negative binomial trends")
abline(h=0, col="red", lwd=2)
plot(tw.local, ylab = "Tweedie trends")
abline(h=0, col="red", lwd=2)
# dev.off()
# pdf(file="gam-tensor-in-sample.pdf")
par(mfcol=c(2,2))
vis.gam(psn.local.te, ylab = "Poisson tensor trends")
vis.gam(zip.local.te, ylab = "ZiP tensor trends")
vis.gam(nb.local.te, ylab = "Negative binomial tensor trends")
vis.gam(tw.local.te, ylab = "Tweedie tensor trends")
# dev.off()
# Predictions in-sample
local.preds <- cbind(dt1[,1:2],
predict(psn.local, type = "response"),
predict(nb.local, type = "response"),
# predict(zip.local, type = "response"),
predict(tw.local, type = "response"))
colnames(local.preds) <- c("series", "time", "P", "NB", "TW")
# Tensor preds
tensor.preds <- cbind(dt1[,1:2],
predict(psn.local.te, type = "response"),
predict(nb.local.te, type = "response"),
predict(tw.local.te, type = "response"))
colnames(tensor.preds) <- c("series", "time", "P", "NB", "TW")
# GLMM preds
glmm.preds <- cbind(dt1[,1:2],
predict(psnar1.glmmtmb, type = "response"),
predict(nbar1.glmmtmb, type = "response"),
# predict(cmpar1.glmmtmb, type = "response"),
predict(twar1.glmmtmb, type = "response"))
colnames(glmm.preds) <- c("series", "time", "P", "NB", "TW")
# # GAMM Preds
# gamm.preds <- cbind(dt1[,1:2],
# exp(fitted(psn1.gamm$lme)),
# exp(fitted(psn.ar1.gamm$lme)),
# exp(fitted(nb.gamm$lme)),
# exp(fitted(tw.gamm$lme)))
# colnames(gamm.preds) <- c("series", "time", "P", "P-AR", "NB", "TW")
Here we show how to compute the CRPS for a single time period for each forecast model.
# Get last obs for comparison of an "in-sample"
lastobs <- dt1[dt1$time==454,]
local.last <- local.preds[local.preds$time==454,]
tensor.last <- tensor.preds[tensor.preds$time==454,]
glmm.last <- glmm.preds[glmm.preds$time==454,]
#gamm.last <- gamm.preds[gamm.preds$time==454,]
# Make draws from the relevant modal predictions
# Local preds in-sample pdf
N <- 1000; # Number of draws
n <- dim(lastobs)[1]
k <- 1; # number of forecasts
set.seed(1234)
local.P.fc <- sapply(1:n, function(i) {rpois(N, local.last$P[i])})
theta <- exp(nb.local$family$getTheta())
local.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=local.last$NB[i])})
local.TW.fc <- t(replicate(N, rTweedie(local.last$TW, p=1.581)))
# Tensor preds in-sample pdf
tensor.P.fc <- sapply(1:n, function(i) {rpois(N, tensor.last$P[i])})
theta <- exp(nb.local.te$family$getTheta())
tensor.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=tensor.last$NB[i])})
tensor.TW.fc <- t(replicate(N, rTweedie(tensor.last$TW,
p=tw.local.te$family$getTheta(TRUE))))
# glmm preds in-sample pdf
glmm.P.fc <- sapply(1:n, function(i) {rpois(N, glmm.last$P[i])})
glmm.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=34, mu=glmm.last$NB[i])})
glmm.TW.fc <- t(replicate(N, rTweedie(glmm.last$TW, p=1.36)))
# gamm preds in-sample pdf
# gamm.P.fc <- sapply(1:n, function(i) {rpois(N, gamm.last$P[i])})
# gamm.PAR.fc <- sapply(1:n, function(i) {rpois(N, gamm.last$`P-AR`[i])})
# gamm.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=34, mu=gamm.last$NB[i])})
# gamm.TW.fc <- t(replicate(N, rTweedie(gamm.last$TW, p=(tw.gamm$gam$family$getTheta(TRUE)))))
# crps
library(scoringutils)
cat("Poisson, local:", sum(crps_sample(lastobs$y, t(local.P.fc))), "\n")
## Poisson, local: 1778.035
cat("NB, local :", sum(crps_sample(lastobs$y, t(local.NB.fc))), "\n")
## NB, local : 3458.133
cat("Tweedie, local:", sum(crps_sample(lastobs$y, t(local.TW.fc))), "\n")
## Tweedie, local: 1593.813
cat("Poisson, tensor:", sum(crps_sample(lastobs$y, t(tensor.P.fc))), "\n")
## Poisson, tensor: 2463.162
cat("NB, tensor :",sum(crps_sample(lastobs$y, t(tensor.NB.fc))), "\n")
## NB, tensor : 3910.693
cat("Tweedie, tensor:",sum(crps_sample(lastobs$y, t(tensor.TW.fc))), "\n")
## Tweedie, tensor: 3276.774
cat("Poisson, glmm :",sum(crps_sample(lastobs$y, t(glmm.P.fc))), "\n")
## Poisson, glmm : 63.32284
cat("NB, glmm :",sum(crps_sample(lastobs$y, t(glmm.NB.fc))), "\n")
## NB, glmm : 460.1975
cat("Tweedie, glmm :",sum(crps_sample(lastobs$y, t(glmm.TW.fc))), "\n")
## Tweedie, glmm : 243.5304
# cat("Poisson, gamm :",sum(crps_sample(lastobs$y, t(gamm.P.fc))), "\n")
# cat("PoissonAR, gamm:",sum(crps_sample(lastobs$y, t(gamm.PAR.fc))), "\n")
# cat("NB, gamm :",sum(crps_sample(lastobs$y, t(gamm.NB.fc))), "\n")
# cat("Tweedie, gamm :",sum(crps_sample(lastobs$y, t(gamm.TW.fc))), "\n")
We are not going to go through the cost of re-estimation here: just roll the models into the end of 2018. This provides the 2018 forecasts based only on data through October 2017.
local.P.2018 <- local.NB.2018 <- local.TW.2018 <- vector(mode = "list", length=14)
k1 <- 454
for(i in 1:14)
{
# Local models
old.p <- apply(local.P.fc, 2, mean)
tmp.p <- predict(psn.local, data.frame(series=local.last[,1],
time = rep((k1+i), 191),
ged_sb_tlag_1 = old.p),
type = "response")
local.P.fc <- sapply(1:n, function(i) {rpois(N, tmp.p[i])})
old.nb <- apply(local.NB.fc, 2, mean)
tmp.nb <- predict(nb.local, data.frame(series=local.last[,1],
time = rep((k1+i), 191),
ged_sb_tlag_1 = old.nb),
type = "response")
theta <- exp(nb.local$family$getTheta())
local.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=tmp.nb[i])})
old.tw <- apply(local.TW.fc, 2, mean)
tmp.tw <- predict(tw.local, data.frame(series=local.last[,1],
time = rep((k1+i), 191),
ged_sb_tlag_1 = old.tw),
type = "response")
local.TW.fc <- t(replicate(N, rTweedie(tmp.tw, p=1.581)))
local.P.2018[[i]] <- local.P.fc
local.NB.2018[[i]] <- local.NB.fc
local.TW.2018[[i]] <- local.TW.fc
}
# Now the same for the other models...
tensor.P.2018 <- tensor.NB.2018 <- tensor.TW.2018 <- vector(mode = "list", length=14)
k1 <- 454
for(i in 1:14)
{
# tensor models
old.p <- apply(tensor.P.fc, 2, mean)
tmp.p <- predict(psn.local.te, data.frame(series=tensor.last[,1],
time = rep((k1+i), 191),
ged_sb_tlag_1 = old.p),
type = "response")
tensor.P.fc <- sapply(1:n, function(i) {rpois(N, tmp.p[i])})
old.nb <- apply(tensor.NB.fc, 2, mean)
tmp.nb <- predict(nb.local.te, data.frame(series=tensor.last[,1],
time = rep((k1+i), 191),
ged_sb_tlag_1 = old.nb),
type = "response")
theta <- exp(nb.local.te$family$getTheta())
tensor.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=tmp.nb[i])})
old.tw <- apply(tensor.TW.fc, 2, mean)
tmp.tw <- predict(tw.local.te, data.frame(series=tensor.last[,1],
time = rep((k1+i), 191),
ged_sb_tlag_1 = old.tw),
type = "response")
tensor.TW.fc <- t(replicate(N, rTweedie(tmp.tw, p=1.581)))
tensor.P.2018[[i]] <- tensor.P.fc
tensor.NB.2018[[i]] <- tensor.NB.fc
tensor.TW.2018[[i]] <- tensor.TW.fc
}
# glmm -- these are hard to do, since they do not easily admit more time
glmm.P.2018 <- glmm.NB.2018 <- glmm.TW.2018 <- vector(mode = "list", length=14)
k1 <- 454
for(i in 1:14)
{
# glmm models
old.p <- apply(glmm.P.fc, 2, mean)
tmp.p <- predict(psnar1.glmmtmb, newdata=data.frame(series=glmm.last[,1],
time = as.factor(rep((k1+i), 191)),
ged_sb_tlag_1 = old.p),
type = "response", allow.new.levels=TRUE)
glmm.P.fc <- sapply(1:n, function(i) {rpois(N, tmp.p[i])})
old.nb <- apply(glmm.NB.fc, 2, mean)
tmp.nb <- predict(nbar1.glmmtmb, data.frame(series=glmm.last[,1],
time = as.factor(rep((k1+i), 191)),
ged_sb_tlag_1 = old.nb),
type = "response", allow.new.levels=TRUE)
# theta <- exp(nbar1.glmmtmb$family$getTheta())
glmm.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=34, mu=tmp.nb[i])})
old.tw <- apply(glmm.TW.fc, 2, mean)
tmp.tw <- predict(twar1.glmmtmb, data.frame(series=glmm.last[,1],
time = as.factor(rep((k1+i), 191)),
ged_sb_tlag_1 = old.tw),
type = "response", allow.new.levels=TRUE)
glmm.TW.fc <- t(replicate(N, rTweedie(tmp.tw, p=1.581)))
glmm.P.2018[[i]] <- glmm.P.fc
glmm.NB.2018[[i]] <- glmm.NB.fc
glmm.TW.2018[[i]] <- glmm.TW.fc
}
# Not going to do the GAMM here, since the CRPS are so bad
# # gamm
# gamm.P.2018 <- gamm.NB.2018 <- gamm.TW.2018 <- vector(mode = "list", length=14)
#
# k1 <- 454
#
# for(i in 1:14)
# {
# # gamm models
# old.p <- apply(gamm.P.fc, 2, mean)
# tmp.p <- predict(psn1.gamm, data.frame(series=gamm.last[,1],
# time = rep((k1+i), 191),
# ged_sb_tlag_1 = old.p),
# type = "response")
# gamm.P.fc <- sapply(1:n, function(i) {rpois(N, tmp.p[i])})
#
# # old.nb <- apply(gamm.NB.fc, 2, mean)
# # tmp.nb <- predict(nb.local.te, data.frame(series=gamm.last[,1],
# # time = rep((k1+i), 191),
# # ged_sb_tlag_1 = old.nb),
# # type = "response")
# # theta <- exp(nb.local.te$family$getTheta())
# # gamm.NB.fc <- sapply(1:n, function(i) {rnbinom(N, size=theta, mu=tmp.nb[i])})
# #
# # old.tw <- apply(gamm.TW.fc, 2, mean)
# # tmp.tw <- predict(tw.local.te, data.frame(series=gamm.last[,1],
# # time = rep((k1+i), 191),
# # ged_sb_tlag_1 = old.tw),
# # type = "response")
# # gamm.TW.fc <- t(replicate(N, rTweedie(tmp.tw, p=1.581)))
# #
# gamm.P.2018[[i]] <- gamm.P.fc
# # gamm.NB.2018[[i]] <- gamm.NB.fc
# # gamm.TW.2018[[i]] <- gamm.TW.fc
# }
#
# #
# #
Here we save the image of the results of the model estimation and model selection for additional processing in forecast output and comparisons in other files.
save.image("ViEWS2-prelim.RData")